Driving a 32-Bit RISC Processor in an FPGA
By Yanzhe Liu and Greg Kahlert
Integrated System Design
Posted 06/08/01, 03:09:50 PM EDT
GDA Technologies (San Jose, CA) is a design-for-hire engineering services firm that specializes in ASIC designs. Increasingly, however, our ASIC customers want to prototype in FPGAs before committing to
silicon. We had a recent contract to help a client achieve a high-speed CPU design in an FPGA. What we learned from the engagement may be helpful to others attempting a similar development.
Our client contracted us to port an optimized 32-bit RISC processor that would run at least 75 MHz and occupy less than 40 percent of area into a Xilinx (San Jose, CA) Vertex XCV 1000 FPGA. The final
design ended up consuming about 80k gates of the 200k gates in the FPGA. We worked with the
customer's specified intellectual property (IP) core vendor, Lexra Inc. (San Jose, CA). We ported Lexra/s LX4189 processor core to the XCV 1000 in two months time. Lexra optimized the IP core to better fit into an FPGA. The final design beat the 75-MHz goal and we were able to demonstrate an 80-MHz clock speed.
We used Xilinx's Alliance Version 3.1 tool kit for hierarchical block-based place and route. During the course of the design effort, Xilinx offered numerous suggestions as to how we could use their tools to make the design faster. For example,
they suggested incremental compilation, multi-pass runs, hierarchical methodology, and so forth. For example, the multiple pass runs produce different results and we were able choose the one that best met our objective. We also used Amplify, an
RTL floorplanning tool, from Synplicity, Inc. (Sunnyvale, CA) as well as their FPGA
synthesis tool, Synplify.
The original RTL of the database (see Figure 1) was coded for an ASIC implementation, not for instantiation into an FPGA. In addition, the configuration for the core was also tailored for an ASIC, not an FPGA environment. In attacking the problem, we ran the LX4189 RTL code through the tools to get a data point from which to begin our work. Using the Amplify floorplanner and the Synplify synthesis tool, we created a netlist that we supplied to the Xilinx place-and-route tool to produce a layout in the Vertex XCV 1000.
The resulting first layout achieved a 50-MHz clock speed. In the worst path of this layout, the delay was 20 nanoseconds. The rule of thumb for FPGAs is that, ideally, 60 percent of the delay should be in logic and 40 percent should be in routing. Our first pass had resulted in 30 percent delay in logic and
70 percent delay in routing. The plan was to squeeze the routing delay down into the range of the logic delay to achieve a total path delay of 12 nanoseconds, worst case, which translates into a 75 to 80 MHz clock speed.
To begin the process of improving the clock speed to our target of 75 MHz, we evaluated the paths that were causing the greatest delay. Our initial analysis of the core showed complex data paths
containing chains of multiplexors (MUXs; see Figure 2). These produced large net delays when implemented in the FPGA. In fact, 70 to 75 percent of the net delays could be attributed to the data path.
In addition, as we got into the design, we identified excess logic in the coprocessor address path as well as register files that could possibly, if eliminated, increase the FPGA clock speed. For example, the core had a MIPS 16 module that was not needed in this implementation. However, after Lexra had removed this module as well as other unneeded logic elements and provided us a new RTL database, we did not see a major improvement in clock speed. The delay was still in the data paths.
We took a look at some obvious problems that might have prevented us from achieving a higher clock
frequency. These problems included multiplexor implementation and whether a tristate MUX is better than some other form of MUX. We made use of the BUFTs common on Virtex CLB (configurable logic blocks). BUFTs are 3-state buffers that drive dedicated,
segmented horizontal-routing resources.
Fanout was another problem area we looked into. By minimizing the number of fanouts, we helped reduce the delay through a number of critical paths. However, we reached a point of diminishing returns where reducing fanouts further increased, rather than decreased, delay. We learned this when we placed constraints into the synthesis tool to reduce the number of fanouts; the constraints we added caused the synthesis tool to
insert additional gates to reduce fanout and thus
increased the delay.
The Amplify floorplanning tool produced two large blocks -- co-processor 0 (CP0) and RPA --that it then placed within the FPGA. RPA represents the arithmetic logic unit and instruction execution pipeline logic of the core. During the design process, we produced a layout of each block independent of the other and aimed to put the two together once we had gotten each block to be as close to 75 MHz as possible.
Of the two blocks, CP0 had the largest number of slow paths, with timing in the range of 48 MHz to 50 MHz. With the help of Amplify, we improved result from 50 to about 66MHz, but after reaching 66 MHz, even with Amplify, it was difficult to improve the timing any further. Therefore, we focused our attention on fixing critical paths in both blocks. At Xilinx's suggestion, we replaced a critical group of multiplexors with tri-state multiplexors.
By identifying a set of paths with timing violations and selectively replacing multiplexors in the paths with
tri-state multiplexors we were able to raise the timing of the entire design to 80 MHz. Achieving an 80 MHz
design was a significant milestone since it represents about a third of the clock speed of the processor in our ASIC implementation. As for the size of the completed design, it occupied 12 out of 96 BLOCKRAMs -- 12 percent, 1505 out of 12,288 SLICEs --12 percent, and 448 out of 12,544 TBUs -- 3 percent, of the Xilinx XCV 1000.
Yanzhe Liu is a design engineer at GDA Technologies Inc. Greg Kahlert is an applications engineer at Lexra, Inc. (San Jose, CA).